We start by bringing in the Python libraries that we will be using for this code. We will be using Torch, which will enable us to run the code in the GPU for faster processing as compared to the CPU. Torch also assists us during backpropagation, using Autograd that does differentiation and stores the gradients that can be called when required. PyTorch also enables us to create neural networks very easily, and PyTorch together with the Numpy, Pandas and Scipy libraries makes it very powerful for Neural Networks calculations.

Neural Networks¶

Neural networks are the building blocks for deep learning. They are made up of neurons or units. Each weighted unit sums up the units of the previous layers and passes the sum through an activation function to get the neuron's output. Some activation functions used in this code are sigmoid, ReLU (Rectified Linear Units), Softmax, tanH (hyperbolic tangent) just to name a few.

A simple neural network would be depicted as shown below:

$$ \begin{align} y &= f(w_1 x_1 + w_2 x_2 + w_3 x_3 + b) \\ y &= f\left(\sum_i w_i x_i +b \right) \end{align} $$

The linear transformation done on each unit is given by the dot product of the input vector and the weight matrix:

$$ h = \begin{bmatrix} x_1 \, x_2 \cdots x_n \end{bmatrix} \cdot \begin{bmatrix} w_{11} w_{21}\\ w_{12} w_{22}\\ \vdots \\ w_{1n} w_{2n} \end{bmatrix} $$

We begin by importing the libraries that we will be using in our code:

torch - used to bring the torch package which will be used in creating the neural network layers, weihts and biases, as well as the backpropagation using autograd.
nn - this module will be used for easy neural network creation.
optim - used to optimize the neural network to adjust the weights to the optimal weights to give us the intended labels.

import torch 
from torch import nn, optim
from torch import optim
import numpy as np

We're going to build a small neural network with only one example as shown below. Our goal is to build a network that will give the correct output [0.0,1.0,0.0]

x = torch.tensor([0.1,0.2,0.7])
x = x.reshape(1,3)
print(x.shape)
print(x)

torch.Size([1, 3])
tensor([[0.1000, 0.2000, 0.7000]])

Our output will be given by the vector below:

labels = torch.tensor([0,1,0])
labels = labels.reshape(1,3)
labels = labels
print(labels.shape)
print(labels)

torch.Size([1, 3])
tensor([[0, 1, 0]])

Here is where the neural network is made and all the fun begins! I will go through the code line by line.

class Network(nn.Module):

Using the nn.Module combined with __super().__init__() creates a class that will assist us with tracking of the neural network methods and attributes. The class name can be changed to anything, although it is mandatory to inherit it from the nn.Module.

self.hidden = nn.Linear(3,3)

This line creates a 3 X 3 matrix for the first hidden layer. nn.Linear(3,3) creates a linear transformation $x\mathbf{W} + b$, with 3 inputs and 3 outputs and assigns it to self.hidden. This module also creates the weights and biases which will be used in the feedforward using the forward method.

The subsequent line, self.hidden2 = nn.Linear(3,3) performs the same function as the first hidden layer. Here we generate a 3 X 3 tensor for the second hidden layer. A linear transformation creates the weights and bias tensors and assigns them to self.hidden2. This will later be used in the forward method.

self.output = nn.Linear(3,3)

This creates a linear transformation in the output of size 3 X 3 to be used to generate the output of the neural network. This will be used in the softmax activation function in the output layer.

self.hidden.weight = torch.nn.Parameter(torch.tensor([[0.1,0.3,0.4],[0.2,0.2,0.3],[0.3,0.7,0.9]]))

This line of code creates the exact weights that will be used in the code. This was done to enable comparison between the code and the hand calculations already performed.

self.hidden2.weight and self.output.weight tensors are also created to explicitly define the tensors for the second hidden layer and the output layer of the neural network.

self.sigmoid = nn.Sigmoid()
self.relu = nn.ReLU()
self.softmax = nn.Softmax(dim=1)

These operations are the sigmoid, Rectified Linear Unit (ReLU) and the softmax activation functions. The softmax is used in the output where we will be generating 3 classes for the 3 output units. The dim=1 is used to ensure that the summation is done across the columns, as opposed to across the rows. The softmax summation should always add to 1 as this is a propability summation.

Some things to note :

The softmax activation function can ONLY be used in the output layer.
The softmax function and the Logistic/sigmoid function yield the same results when the number of classes = 2, therefore softmax is a generalization of the sigmoid function.

def forward (self,x):

PyTorch neural networks require a forward method defined to be able to perform the feedforward step. This is achieved by taking a tensor x and taking it through all the operation defined in the __init__ method.

x = self.hidden(x)

x = self.relu(x)

x = self.hidden2(x)

x = self.sigmoid(x)

x = self.output(x)

x = self.softmax(x)

Here we pass the input tensor x through the hidden layes, hidden layer 2, the sigmoid the output layer and finally the softmax function. The operations must be sequenced correctly in the forward method to ensure that the operations are done sequentially as expected.

class Network(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Inputs to hidden layer linear transformation
        self.hidden = nn.Linear(3, 3)
        self.hidden2 = nn.Linear(3, 3)
        self.output = nn.Linear(3, 3)
        
        self.hidden.weight = torch.nn.Parameter(torch.tensor([[0.1,0.3,0.4],[0.2,0.2,0.3],[0.3,0.7,0.9]]))
        self.hidden2.weight = torch.nn.Parameter(torch.tensor([[0.2,0.3,0.6],[0.3,0.5,0.4],[0.5,0.7,0.8]]))
        self.output.weight = torch.nn.Parameter(torch.tensor([[0.1,0.3,0.5],[0.4,0.7,0.2],[0.8,0.2,0.9]]))
       
        
        
        # Define sigmoid activation, ReLU and softmax output 
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, x):
        # Pass the input tensor through each of our operations
        x = self.hidden(x)
    
        x = self.relu(x)
       
        x = self.hidden2(x)
        
        x = self.sigmoid(x)
        
        x = self.output(x)
        
        x = self.softmax(x)
        
        return x

We can now create a Network object model = Network()

An epoch in Machine Learning is the number of times we want to pass through the same data to optimize the weights to reduce the loss. For this neural network, we will use 50 epochs.

The next thing after the feedforward is the calculation of the loss of the neural network. We will be using the nn.CrosEntropyloss() criterion. This criterion is a combination of the nn.LogSoftmax() and nn.NLLLoss. The NLLLoss stands for the Negative Log Likelihood Loss. It is used when softmax function is being used in the output layer of the neural network. This loss maximizes the likelihood of the correct label while reducing the other classes to 0.

loss.backward performs the backpropagation and calculates the gradients in each weight and bias.

PyTorch has a module, autograd that enables us to automatically calculate the gradients of the tensors. We can use this to calculate the gradients with all of our parameters to the loss. Autograd tracks the operations performed during the feedforward, then calculates the gradient dugin backpropagation. Autograd, basically does differentiation, for those who are familiar with math Calculus using chain rule. During backpropagation, it is important to ensure that the gradient calculations are turned on by setting requires_grad = True on a tensor. This can be done during the tensor creation with required_grad keyword, or any time using y.required_grad_True).

Gradient can be trurned off in a block of code using torch.no_grad() or globally on the entire code using torch.set_grad_enables(True|False).

To train the network, we use the optim package. For this network, we use the optim.Adam package. There are a lot of other packages we can use such as optim.SGD where the opimizer used Stochastic gradient descent for optimization. It is critical to zero out the gradient in each epoch using optimizer.zero_grad() to ensure that the gradients are not accumulated for each training .For this network, the learning rate, lr=0.09 was adequate enough to ensure that we got to a global minumum as fast as possible.

model = Network()
epochs = 50
for t in range (1,epochs):
    criterion = nn.CrossEntropyLoss()
    logps = model(x)
    print(labels)
    optimizer.zero_grad()
    loss = criterion(logps, torch.max(labels, 1)[1])
    loss.backward()
    optimizer = optim.Adam(model.parameters(), lr=0.09)
    optimizer.step()
    print(loss)
    print(model.forward(x))

tensor([[0, 1, 0]])
tensor(1.2942, grad_fn=<NllLossBackward>)
tensor([[0.1865, 0.2612, 0.5523]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(1.1835, grad_fn=<NllLossBackward>)
tensor([[0.1551, 0.3949, 0.4499]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(1.0450, grad_fn=<NllLossBackward>)
tensor([[0.1093, 0.5608, 0.3300]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.8881, grad_fn=<NllLossBackward>)
tensor([[0.0671, 0.7224, 0.2105]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.7508, grad_fn=<NllLossBackward>)
tensor([[0.0349, 0.8473, 0.1178]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.6554, grad_fn=<NllLossBackward>)
tensor([[0.0166, 0.9239, 0.0595]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.6015, grad_fn=<NllLossBackward>)
tensor([[0.0076, 0.9638, 0.0286]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5748, grad_fn=<NllLossBackward>)
tensor([[0.0035, 0.9830, 0.0135]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5624, grad_fn=<NllLossBackward>)
tensor([[0.0016, 0.9919, 0.0064]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5566, grad_fn=<NllLossBackward>)
tensor([[7.7343e-04, 9.9615e-01, 3.0752e-03]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5539, grad_fn=<NllLossBackward>)
tensor([[3.7108e-04, 9.9815e-01, 1.4820e-03]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5526, grad_fn=<NllLossBackward>)
tensor([[1.7934e-04, 9.9910e-01, 7.1770e-04]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5520, grad_fn=<NllLossBackward>)
tensor([[8.7016e-05, 9.9956e-01, 3.4852e-04]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5517, grad_fn=<NllLossBackward>)
tensor([[4.2303e-05, 9.9979e-01, 1.6948e-04]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5516, grad_fn=<NllLossBackward>)
tensor([[2.0584e-05, 9.9990e-01, 8.2469e-05]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5515, grad_fn=<NllLossBackward>)
tensor([[1.0021e-05, 9.9995e-01, 4.0143e-05]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5515, grad_fn=<NllLossBackward>)
tensor([[4.8810e-06, 9.9998e-01, 1.9544e-05]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5515, grad_fn=<NllLossBackward>)
tensor([[2.3791e-06, 9.9999e-01, 9.5181e-06]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5515, grad_fn=<NllLossBackward>)
tensor([[1.1613e-06, 9.9999e-01, 4.6379e-06]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[5.6857e-07, 1.0000e+00, 2.2625e-06]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.8000e-07, 1.0000e+00, 1.1062e-06]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[1.3950e-07, 1.0000e+00, 5.4340e-07]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[7.0981e-08, 1.0000e+00, 2.6930e-07]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[3.7420e-08, 1.0000e+00, 1.3567e-07]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.1009e-08, 1.0000e+00, 7.1081e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[1.2370e-08, 1.0000e+00, 3.8293e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[1.0557e-08, 1.0000e+00, 2.9668e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[9.1359e-09, 1.0000e+00, 2.3448e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[8.0040e-09, 1.0000e+00, 1.8903e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[7.0888e-09, 1.0000e+00, 1.5531e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[6.3387e-09, 1.0000e+00, 1.2987e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[5.7160e-09, 1.0000e+00, 1.1036e-08]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[5.1930e-09, 1.0000e+00, 9.5134e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[4.7490e-09, 1.0000e+00, 8.3063e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[4.3686e-09, 1.0000e+00, 7.3346e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[4.0398e-09, 1.0000e+00, 6.5411e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[3.7533e-09, 1.0000e+00, 5.8848e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[3.5018e-09, 1.0000e+00, 5.3353e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[3.2796e-09, 1.0000e+00, 4.8703e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[3.0820e-09, 1.0000e+00, 4.4729e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.9055e-09, 1.0000e+00, 4.1302e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.7469e-09, 1.0000e+00, 3.8322e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.6037e-09, 1.0000e+00, 3.5713e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.4739e-09, 1.0000e+00, 3.3411e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.3558e-09, 1.0000e+00, 3.1369e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.2479e-09, 1.0000e+00, 2.9546e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.1489e-09, 1.0000e+00, 2.7911e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[2.0579e-09, 1.0000e+00, 2.6437e-09]], grad_fn=<SoftmaxBackward>)
tensor([[0, 1, 0]])
tensor(0.5514, grad_fn=<NllLossBackward>)
tensor([[1.9740e-09, 1.0000e+00, 2.5102e-09]], grad_fn=<SoftmaxBackward>)

Below I take you through a quick example showing how autograd calculates and performs backpropagation.

a = torch.randn(2,2, requires_grad=True)
print(a)

tensor([[0.3183, 1.0871],
        [1.7827, 1.3820]], requires_grad=True)

y = a**3
print(y)

tensor([[0.0323, 1.2847],
        [5.6654, 2.6394]], grad_fn=<PowBackward0>)

Here, we see that y is created with the a power operation PowBackward0

m = y**2
print(m)

tensor([[1.0407e-03, 1.6503e+00],
        [3.2096e+01, 6.9662e+00]], grad_fn=<PowBackward0>)

The y then goes through another power operation where it is squared.

print(m.grad_fn)

<PowBackward0 object at 0x11b5e2c10>

Autograd tracks all the operations and calculates the gradients for each operation during backpropagation. Next we take a mean operation to get a scalar value from m.

z = m.mean()
print(z)

tensor(10.1785, grad_fn=<MeanBackward0>)

print(a.grad)

None

We can see that there are no gradients since no backpropagation has been done at this point.The gradient after the z.backward() results in exactly the same results from the math differentiation with respect to a: $$ \frac{\partial z}{\partial a} = \frac{\partial}{\partial a}\left[\frac{1}{n}\sum_i^n (a_i^3)^2\right] = \frac{3a^5}{2} $$

z.backward()
print(a.grad)
print((3*(a**5))/2)

tensor([[4.9037e-03, 2.2772e+00],
        [2.7007e+01, 7.5611e+00]])
tensor([[4.9037e-03, 2.2772e+00],
        [2.7007e+01, 7.5611e+00]], grad_fn=<DivBackward0>)

The results from the autograd's z.backward()yields the same result as the differentiation using chain rule from Calculus.